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ABSTRACT 


It  is  common  to  estimate  a  distribution  by  means  of  a  step  function 
Such  estimates  can  be  made  continuous  by  connecting  the  left  points  of 
the  steps  with  straight  line  segments.   In  this  paper,  the  best 
estimator  of  this  class  is  found  for  data  which  is  exponentially 
distributed  using  minimum  risk.  The  risk  is  then  compared  with  those 
of  the  sample  distribution  function  and  the  Pyke  estimator. 
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I.  INTRODUCTION 

The  motivation  behind  this  thesis  is  to  provide  a  contribution 
towards  the  objective  of  finding  the  best  way  to  estimate  an  unknown 
continuous  distribution.  The  cue  is  taken  from  References  4  and  5 
which  have  adopted  the  following  principles: 

(1)  The  estimator  itself  should  be  a  continuous  function. 

(2)  The  estimator  should  be  simple  and  natural . 

The  first  principle  has  not  been  generally  adopted.  The  modern 
textbook  solution  [Ref.  3]  is  to  use  the  sample  distribution  (step) 
function.  A  simple  and  natural  extension  of  the  step  function, 
satisfying  both  of  the  above  principles,  is  to  connect  the  left  end 
of  the  steps  with  straight  line  segments,  following  the  example  of  the 
so-called  Ogive  curve.  Thus,  the  class  of  estimators  can  be  described 
by: 

(a)  plot  the  points  (C  ,X  ),  r  =  1,  ...  ,n 

(b)  connect  these  points  with  straight  line  segments  where  0  <  C-, 
£  C«  s  ...  s  C  £  1  are  constants  determined  by  some  rule  and  X-,  <  X~ 
...  <  X  are  the  order  statistics  of  a  random  sample  of  size  n  drawn 
from  some  unknown  CDF,  F.  The  only  issue  to  be  resolved  is  how  to 
determine  the  sequence  C.'"  -, . 

I  T"1 

Reference  6  suggests  using  C  =  — q-  ,  refering  to  the  result  as  the 
Pyke  estimator,  and  shows  that  the  expected  squared  error  of  a  continuous 
Pyke  estimator  for  some  distribution,  F,  is  no  larger  than  the  expected 
squared  error  of  the  sample  distribution  function  for  a  sufficiently 
large  sample  size.  It  is  also  shown  that  the  Pyke  estimator  strictly 
dominates  the  sample  distribution  function  for  sample  sizes  greater 


than  or  equal  to  one.  The  risk  function  used  is  the  integrated 
expected  squared  error,  a  popular  choice  [Ref.  1]. 

In  Ref.  4,  the  sequence  C  •_,  for  which  the  risk  is  minimized 
is  found  for  the  case  where  F  is  the  uniform  distribution.  Compari- 
sons of  this  optimal  risk  with  that  of  the  sample  distribution  function 
and  the  Pyke  estimator  are  made.  The  result  is  that  the  risk  of  the 
Pyke  estimator  is  significantly  closer  to  the  optimal  than  the  risk  of 
the  sample  distribution  function. 

The  purpose  of  this  thesis  is  to  determine  if  the  risk  of  the  Pyke 
estimator  remains  closer  to  the  optimal  than  that  of  the  sample  distribu- 
tion function  for  a  random  sample  drawn  from  the  exponential  distribution 
The  approach  will  be  similar  to  that  used  in  Ref.  4.  In  an  attempt  to 
provide  continuity  for  the  reader,  the  notation  of  Ref.  4  is  used 
wherever  possible. 


II.  DETERMINATION  OF  THE  MINIMUM  RISK  COEFFICIENTS 


Let  Ca,  C, C  ,  C  +,  be  an  increasing  sequence  with  C«  eO 

and  C  ,  =  1.  The  continuous  function  estimator  for  F(x)  can  be 
defined  by: 


H(x)  =  Cp  +  j 


(Cr+1 


r+1  r 
r  =  0,  1 ,  ... ,  n 


Cr)  for  Xr  s  x  i  xr+1 , 


(2.1) 


where  X-,  <  X« 


<  X  are  the  order  statistics  from  an  absolutly 


(2.2) 


continuous  distribution  F(x).  Assume  that  the  population  is  lower 
bounded  and  contained  in  the  interval  [0,°°),  then  define  Xfi  =  0, 
X  +-j  e  °°  and  the  risk  function  by: 

F(x)  -  H(x)   dF(x) 
0  L         J 

H(x)  is  defined  piecewise  according  to  which  of  the  random  intervals 

(X  ,X  , )  contains  x.  Assume  that  the  sample  is  extracted  from  an 

exponential  distribution.  For  convenience,  let  the  distribution's 

mean  equal  one.   Let  u,  v  with  u  <  v  be  the  variables  in  the  sample 

space  of  (X  ,X  ,).  Their  joint  density  function  is: 


n: 


/,    -Uxr-1  -u  -v(n-r) 
fr,r-H(u>v)  -  (r-l)!(n-r-1):  (1  '  e  )   e  e 


for  0  <  u  <  v  <  -  and  1  c   r  a   n-1 . 


The  value  of  the  mean  does  not  matter  since  the  risk  does  not 
change  with  linear  transformations  of  the  basic  random  variable. 
(Personal  communication  from  Professor  R.R.  Read) 


The  end  point  densities  can  be  derived  as 


-vn 


f0fl(0,v)  =  ne~v",  0  <  v 


<  oo 


j-      /    \      -U/n     -U\n-1   n 

fn  n+l^u'°°^  =  ne  ^  T  e  '   ,  0  <  u 


(2.4) 
(2.5) 


Define  Ap  =  Cr+]    -  Cp, 
then: 


Rewriting  (2.2) 


r-1 


E       A, 


r       Fo    J 


*=E      ///    [I—- 
r=0  0<u<x<v<ooL 


c  .£i4 

r   v-u  r 


n2 


e"xf    , (u,v)dudvdx 
r,r+l 


(2.21) 


The  minimum  risk  coefficients,  C  ,  can  now  be  found  using  classical 

optimization  techniques.  Using  the  Lagrangain  form,  the  problem 

becomes : 

n 
Minimize:  $  =  R-x(]Ta.-1) 

j=0  J 


Subject  to:   E  A .  =  1 
j=0   J 

where  x  is  the  Lagrange  Multiplier.  Thus,  the  approach  becomes 


Set  —  =  0  for  k  =  0,  1,  . . .  ,n  and  solve  for  C.. 


3A, 


=  -2 


fff 


N  -x       r         x-u    . 

-  e       -  C,    -  A, 

k      v-u     k 


0^Us.XiV<°" 


x-u  -x 


^e-'fkjk+1(u,v)dudvdx 


n 
r=k+l   (ku<:x£V«» 


fffh 


-X        r  x-u    . 

e       -  L        a 

r      v-u     r 


e"Xfr  r+1(u,v)dudvdx 


+  X 
=  0 


(2.6) 


which  after  laborious  but  straightforward  integration  leads  to 


n-k+1 


2(n-k)(n-k+l)   -   (2n-2k+l )  (n-k+1  Mn-kjln1^ 


-  C 


k+1 


(n-k)(2n-2k+l)   -  2(n-k)2(n-k+l  )ln^^- 


n 

+  E 


r=k+l 


(n-r+1  )(n-r)ln^^-  -   (n-r+1) 


n-r 


-  E     c 

r=K+l 


r+1 


(n-r+1 )(n-r)ln 


n-r+1 


n-r 


-   (n-r) 


-  (n-k+2)(n-k+l)(n-k)1   n-k+2  _   (n-k+1 )(n-k)        ,     .    n 
4(n+2)   ' 1n~rT=lT"  2(n+2)  (n  k+lj 


(n_k)lnn^I+(n_k)_2+irl|^(n+1) 


(2.7) 


For  k  =  n,   (2.7)   implies  X  =  0  (the  indeterminate  form  -(n-k)ln(n-k) 
is  zero  by  use  of  (2.4)  and   (2.5)). 

Using  (2.5)   in  (2.6)   for  k  =  n  provides: 


9$_ 
3A. 


ff   (An  -  e^Je^ne^d-e"")""1  dudx  =  0  (2.8) 


0<U:SX<°° 


which  imples: 


n+1 
n+2  ' 


To  simplify  notation,  let: 

F^k)  =  (n-k+1  )(n-k)ln^l  -  2(n-k+2)(n-k+l )  +  2(n-k+2)(n-k+l )2 


In 


n-k+2 
n-k+1 


n-k+1 


F2(k)  =  2(n-k)(n-k+l)  -  (2n-2k+l  )(n-k+l  )(n-k)ln1^ 
F,(k)  =  (n-k+l)(n-k)ln^l  -  (n-k+2)  (n-k+1  )ln-n~ 


n-k 


n-k+1 


H(k)  _   (n-k+2)(n-k+1)(n-k)1  n-k+2   (n-k+l)(n-k)   ,  k+1Wn-k) 


n  n-k+1   /  |  v   v-^ 
ln-r-r—  +  (n-k)  -  2^ 


n-k 


r=k+l 


r+1   X(n+1) 
n+2  "  2 
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Equation  (2.7)  may  be  written  in  matrix  form: 


where 


A  = 


F^l)  F3(2)  F3(3) 
F2(l)  F](2)  F3(3) 


0    F2(2)  ^(3) 

0    0 
0    0' 


F2(3) 


0 

0 

0 

0 

0 

0 

0 

0 

0 

A  C  =  H 


F3(n-1)  F3(n) 

F3(n-1)  F3(n) 

F3(n-1)  F3(n) 

F3(n-1)  F3(n) 

F3(n-1)  F3(n) 


F3(n-1)  F3(n) 
F^n-1)  F3(n) 
F2(n-1)  F^n) 


(2.9) 


C  = 


'n+1 


and   H  = 


H(0) 
H(l) 


H(n) 


Successive  row  subtraction  with  the  last  row  and  the  last  column 
discarded  (since  C  has  already  been  determined)  results  in  A^ 
becoming  a  tri -diagonal  matrix  of  the  form: 


11 


A'  = 


where 


and 


G^l)  G2(l)  0     0 

G2(l)  6^2)  G2(2)   0 

0  G2(2)  G-,(3)  G2(3) 

0     0  G2(3)  ^(4) 

0     0  0   G2(4) 


0  0  0  0 
0  0  0  0 
0     0     0     0 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

G^n-3)  G2(n-3)   0 
G2(n-3)  G,(n-2)  G2(n-2) 

0    G2(n-2)  G^n-1) 


G^k)  =  F^k)  -  F2(k) 

G2(k)  =  F3(k)  -  F^k)  =  F2(k-1) 


H"  = 


H'(l) 
H'(2) 


where  H'(k)  =  H(k-l)  -  H(k) 


H'(n-l) 
Equation  (2.9)  becomes: 

A '  C  =  H '  (2.91) 

Equation  (2.91)  represents  a  2nd  order  linear  difference  equation 
with  variable  coefficients.  An  explicit  solution  appears  to  be  out 
of  reach,  forcing  the  use  of  numerical  methods. 

This  matrix  equation  can  be  solved  by  triangular  decomposition  for 
any  given  sample  size.  The  algorithm  for  this  deterministic 
solution  and  the  associated  Fortran  subroutine,  TRID,  are  described 


12 


in  Ref.  2.  Coefficients  for  various  sample  sizes  were  computed  using 
TRID  and  are  shown  with  the  corresponding  Pyke  coefficients  in  Table  I 
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TABLE  I. 

COEFFICIENTS  l 

N 

C 

MINIMUM  R 

1 

1 

0.6666667 

2 

1 

0.3383387 

2 

0.7500000 

3 

1 

0.3432969 

2 

0.4987318 

3 

0.8000000 

4 

1 

0.2800868 

2 

0.4240374 

3 

0.5887811 

4 

0.8333333 

5 

1 

0.2394752 

2 

0.3578057 

3 

0.5126778 

4 

0.6460857 

5 

0.8571429 

6 

1 

0.2085332 

2 

0.3125745 

3 

0.4437924 

4 

0.5721471 

5 

0.6906526 

6 

0.8750000 

7 

1 

0.1848704 

2 

0.2769643 

3 

•  0.3940854 

4 

0.5042591 

5 

0.6200241 

6 

0.7249480 

7 

0.8888889 

8 

1 

0.1660087 

2 

0.2488631 

3 

0.3539021 

4 

0.4534583 

5 

0.5541501 

6 

0.6579414 

7 

0.7524714 

8 

0.9000000 

PYKE 


0.5000000 

0.3333333 
0.6666667 

0.2500000 
0.5000000 
0.7500000 

0.2000000 
0.4000000 
0.6000000 
0.8000000 

0.1666667 
0.3333333 
0.5000000 
0.6666667 
0.8333333 

0.1428571 
0.2857143 
0.4285714 
0.5714286 
0.7142857 
0.8571429 

0.1250000 
0.2500000 
0.3750000 
0.5000000 
0.6250000 
0.7500000 
0.8750000 

0.1111111 
0.2222222 
0.3333333 
0.4444444 
0.5555556 
0.6666667 
0.7777778 
0.8888889 
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N 

c 

MINIMUM  RISK 

9 

1 

0.1506598 

2 

0.2259314 

3 

0.3213671 

4 

0.4115223 

5 

0.5034373 

6 

0.5946056 

7 

0.6890570 

8 

0.7749696 

r> 

9 

0.9090909 

U 

1 

0.1379129 

2 

0.2068969 

3 

0.2943053 

4 

0.3768931 

5 

0.4608337 

6 

0.5447461 

7 

0.6284070 

8 

0.7149642 

9 

0.7937232 

A 

10 

0.9166667 

U 

1 

0.0747509 

2 

0.1123907 

3 

0.1599652 

4 

0.2048848 

5 

0.2505187 

6 

0.2959665 

7 

0.3414700 

8 

0.3869660 

9 

0.4324732 

10 

0.4779891 

11 

0.5235179 

12 

0.5690642 

13 

0.6146325 

14 

0.6602411 

15 

0.7058782 

16 

0.7516879 

17 

0.7973108 

18 

0.8445265 

19 

0.8874853 

20 

0.9545455 

PYKE 


0.1000000 
0.2000000 
0.3000000 
0.4000000 
0.5000000 
0.6000000 
0.7000000 
0.8000000 
0.9000000 

0.0909091 
0.1818182 
0.2727273 
0.3636364 
0.4545455 
0.5454545 
0.6363636 
0.7272727 
0.8181818 
0.9090909 

0.0476190 
0.0952381 
0.1428571 
0.1904762 
0.2380952 
0.2857143 
0.3333333 
0.3809524 
0.4285714 
0.4761905 
0.5238095 
0.5714286 
0.6190476 
0.6666667 
0.7142857 
0.7619048 
0.8095238 
0.8571429 
0.9047619 
0.9523810 
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N 

C 

MINIMUM  RISK 

PYKE 

50 

1 

0.0315038 

0.0196078 

2 

0.0474483 

0.0392157 

3 

0.0675612 

0.0588235 

4 

0.0865575 

0.0784314 

5 

0.1058530 

0.0980392 

6 

0.1250684 

0.1176471 

7 

0.1443054 

0.1372549 

8 

0.1635367 

0.1568627 

9 

0.1827696 

0.1764706 

10 

0.2020022 

0..  1960784 

11 

0.2212349 

0.2156863 

12 

0.2404678 

0.2352941 

13 

0.2597008 

0.2549020 

14 

0.2789339 

0.2745098 

15 

0.2981671 

0.2941176 

16 

0.3174005 

0.3137255 

17 

0.3366341 

0.3333333 

18 

0.3558678 

0.3529412 

19 

0.3751017 

0.3725490 

20 

0.3943358 

0.3921569 

21 

0.4135701 

0.4117647 

22 

0.4328046 

0.4313725 

23 

0.4520395 

0.4509804 

24 

0.4712746 

0.4705882 

25 

0.4905101 

0.4901961 

26 

0.5097460 

0.5098039 

27 

0.5289823 

0.5294118 

28 

0.5482191 

0.5490196 

29 

0.5674565 

0.5686275 

30 

0.5866945 

0.5882353 

31 

0.6059332 

0.6078431 

32 

0.6251728 

0.6274510 

33 

0.6444134 

0.6470588 

34. 

0.6636552 

0.6666667 

35 

0.6828984 

0.6862745 

36 

0.7021433 

0.7058824 

37 

0.7213902 

0.7254902 

38 

0.7406397 

0.7450980 

39 

0.7598924 

0.7647059 

40 

0.7791493 

0.7843137 

41 

0.7984114 

0.8039216 

42 

0.8176810 

0.8235294 

43 

0.8369599 

0.8431373 

44 

0.8562559 

0.8627451 

45 

0.8755638 

0.8823529 

46 

0.8949449 

0.9019608 

47 

0.9142469 

0.9215686 

48 

0.9342227 

0.9411765 

49 

0.9523976 

0.9607843 

50 

0.9807692 

0.9803922 
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III.     CALCULATION  OF  THE  RISKS 


From  equation   (2.2),  the  risk  function  for  the  minimum  risk 
coefficients,  R(min),   is: 

R(min)  =    £        fff      1   -  e"x  -  Cr  -  ££  4pj     e-xfrjr+1(u,v)dudvdx 


0^u^x^v<°° 


n    ((1^      2(Cr-1)(n-r+1)  (n_r+2)(n_r+1  )(n.r) 

'  r^         n+1  (n+2)(n+l)  r     2(n+2)(n+l) 


I' 


In 


n-r+2     _       2 
n-r  n-r+2 


+  2AV,(CV,-1)   (n-r+1)(n-r) 


*2)(n+l) 

n-r+1 


rx  r 


In' 


n-r 


1 


n-r+1 


+  a2  (n-r+1 )(n-r) 
Ar  n+1 


2n-2r+1       9,     „^Bn-r+1 


(n-r+l)(n-r+2) 
(n+3)(n+2)(n+l) 

The  risk  function  for  the  Pyke  estimator,  R(Pyke),   is: 


(3.1) 


R(Pyke)  =    E         fff 
r=0     „  JJJ 


1   -  e 


-x         r         x-u   /   1 


n+1       v 


-u   ^n+1^ 


r=0  ((n+1) 


0<u<x<v<°° 

3     5(n-r)^  +  5(n-r)    f  1 


e~  f       ., (u,v)dudvdx 
r,r+1 


.fcf  v    ,.,__,, 


(n-r+1 )(3n-3r+l) 
(n+l)2(n+2) 


(n-r+1) (n-r+2)         2(n-r)(n-r+1 )(2n-2r+l ),   n-r+1 
(n+3)(n+2)(n+l)    "  (n+1)3  ln  n-r 

+  (n-r)(n-r+1)(n-r+2)   ,   n-r+2  ( 
2(n+l)2(n+2)  "  n_r    j 


(3.2) 
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The  risk  function  for  the  sample  distribution  function,  R(SDF), 


is  : 


R(SDF)   =    £  fff     [l    -  e'x  -  £      e"xfrjr+1(u,v)dudvdx 


OsusX£V<°° 


1_ 

6n    ' 


(3.3) 
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IV.  RESULTS  AND  CONCLUSIONS 

Values  of  each  type  risk  were  computed  for  various  sample  sizes 
using  (3.1),  (3.2)  and  (3.3)  and  are  shown  in  Table  II  and  Figure  1 
where  it  becomes  apparent  that  the  risk  of  the  Pyke  estimator  is 
significantly  closer  to  the  optimum  than  is  the  risk  of  the  sample 
distribution  function.  All  the  risks  converge  to  zero  at  the  rate 
1/n.  Thus,  n  times  the  risk  converges  to  a  constant  (1/6).   It  is 
interesting  to  compare  the  risks  as  they  approach  this  asympotote. 
This  is  done  in  Table  III  and  Figure  2. 

These  results  coupled  with  those  of  Ref.  4  suggest  that,  given 
the  criteria  of  minimizing  expected  squared  error,  the  Pyke  estimator 
should  be  used  in  lieu  of  the  sample  distribution  function, 
particularly  if  the  underlying  population  is  suspected  to  be  either 
exponential  or  uniform. 
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N 

1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
12 
14 
16 
18 
20 
30 
40 
50 
60 
70 
80 
90 
100 


II.  VALUE  OF  RISK  1 

FUNCTIONS  FOR  S/ 

\MPLE  SIZE  N 

MINIMUM  RISK 

PYKE  RISK 

SDF  RISK 

0.0480993 

0.0682656 

0.1666667 

0.0363529 

0.0394270 

0.0833333 

0.0276659 

0.0298089 

0.0555556 

0.0233801 

0.0246450 

0.0416667 

0.0203317 

0.0212262 

0.0333333 

0.0180346 

0.0187209 

0.0277778 

0.0162237 

0.0167779 

0.0238095 

0.0147538 

0.0152155 

0.0208333 

0.0135338 

0.0139269 

0.0185185 

0.0125036 

0.0128436 

0.0166667 

0.0108566 

0.0111197 

0.0138889 

0.0095964 

0.0098068 

0.0119048 

0.0086000 

0.0087725 

0.0104167 

0.0077919 

0.0079360 

0.0092593 

0.0071232 

0.0072455 

0.0083333 

0.0049866 

0.0050497 

0.0055556 

0.0038370 

0.0038755 

0.0041667 

0.0031184 

0.0031444 

0.0033333 

0.0026266 

0.0026453 

0.0027778 

0.0022689 

0.0022830 

0.0023810 

0.0019969 

0.0020079 

0.0020833 

0.0017832 

0.0017921 

0.0018519 

0.0016108 

0.0016181 

0.0016667 
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Figure  1 
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TABLE  III.  VALUE  OF  THE  RISK  FUNCTIONS  X  SAMPLE  SIZE 


N 

1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
12 
14 
16 
18 
20 
30 
40 
50 
60 
70 
80 
90 
100 


N(MINIMUM  RISK) 

N(PYKE  RISK) 

N(SDF  RISK) 

0.0480993 

0.0682656 

0.1666667 

0.0727058 

0.0788540 

0.1666667 

0.0829976 

0.0894267 

0.1666667 

0.0935202 

0.0985800 

0.1666667 

0.1016586 

0.1061310 

0.1666667 

0.1082079 

0.1123254 

0.1666667 

0.1135658 

0.1174451 

0.1666667 

0.1180301 

0.1217241 

0.1666667 

0.1218044 

0.1253423 

0.1666667 

0.1250362 

0.1284358 

0.1666667 

0.1302796 

0.1334367 

0.1666667 

0.1343494 

0.1372958 

0.1666667 

0.1375995 

0.1403596 

0.1666667 

0.1402547 

0.1428488 

0.1666667 

0.1424646 

0.1449103 

0.1666667 

0.1495975 

0.1514916 

0.1666667 

0.1534788 

0.1550203 

0.1666667 

0.1559200 

0.1572188 

0.1666667 

0.1575977 

0.1587195 

0.1666667 

0.1588217 

0.1598089 

0.1666667 

0.1597542 

0.1606358 

0.1666667 

0.1604884 

0.1612848 

0.1666667 

0.1610815 

0.1618077 

0.1666667 
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